Prediction on the Success of Bank Telemarketing on Term Deposit Subscription

Xiao Liu

Nov 7, 2018

Introduction

This project looks at data for a telemarketing campaign conducted by a bank to get clients to sign up for a bank term deposit with them.

The goal is to predict if the client will subscribe (yes/no) a term deposit, so that the callers can be more efficient when target at potential clients for promoting the term deposit and therefore leading to more profit for the bank. With the goal above, classifying clients that will sign up successfully and avoiding misclassification of those clients is extremely important.

Data

The dataset originally contains 4119 observations with 21 variables (response variable included) and there is no missing value. The explanatory variables contain both demographic information and information about previous contact from past marketing campaigns.

The duration variable, as suggested by the author, should not be used for prediction and should only be used as a benchmark. This is because duration is not known before the call but is clear after the call.

The variable pdays is coded such that the value of ‘999’ means that there was no previous contact. Since the majority of the observations have the value of ‘999’, so pdays was converted for this report to be a factor variable and was renamed as factor_pdays with 4 levels for those who were never contacted, those who were last contacted between 0 and 1 week, people who were last contacted between 1 to 2 weeks, and people who were last contacted over 2 weeks ago.

After removing duration, the dataset used for analysis contains 4119 observations with 20 variables(response variable included).

To get a better understanding of the data we have, let’s first have a look at the features of some variables in the dataset.

The response variable is y and takes values Yes and No. The plot shows extreme unbalanced data with only about 10% of y being “Yes”.

Admin (1012), blue-collar (884), and technician (691) are the major jobs of the clients who were contacted (numbers in bracket are the corresponding number of clients in the type of job).

University degree (1264) and high school degree (921) are the major groups of the education level of clients who were contacted (numbers in bracket are the corresponding number of clients in the type of education level). Only one of the contacted client was illiterate.

Here we fit a GBM model in order to find important variables. As it shows below, job, month, euribor3m, nr.employed, age, education, and day_of_week are relatively important variables.

Methods

The data was first split into a train dataset and test dataset using the createDataPartition function for a 80/20 split for the purpose of measuring the prediction accuracy. This put 3,296 observations in the training dataset and 823 in the test dataset.

Models

Three models were considered:

For each different method, 3 variable subsets were considered:

small_form = formula(y ~ job + month + euribor3m + nr.employed)
med_form = formula(y ~ job + month + euribor3m + nr.employed + age +
                        education + day_of_week)
full_form = formula(y ~ .)

Since we have a total of 9 models, it would take up too much space if including all the training models here. So, some examples of the training model were given below. All of 9 models were trained using the train function from the caret package, and tuning was done using 5-fold cross-validation.

set.seed(12345)
rf_small = train(small_form, data = bank_add_trn,
                         method = "rf",
                         metric = "ROC",
                         ntree = 1000, 
                         trControl = trainControl(
                           method = "cv", 
                           number  = 5, 
                           classProbs = TRUE, 
                           summaryFunction = twoClassSummary))

elastic_median = train(med_form, data = bank_add_trn,
                      method = "glmnet",
                         metric = "ROC",
                         trControl = trainControl(
                           method = "cv", 
                           number  = 5, 
                           classProbs = TRUE, 
                           summaryFunction = twoClassSummary))

Strategies

A balance function was created and used to choose the optimal cutoff value. This balance function assumes that the average bank deposit is $100 and that a telemarketer caller makes $15 per hour. From the dataset, the average duration of a call is 4.28 minutes meaning that, based on the $15/hr assumption, employee compensation per call is $1.07.

A quasi-opportunity cost for an employee:
the percent of successful responses x the assumed average bank deposit + the percent of unsuccessful responses x the employee compensation per call, which was calculated to be $20.95.

With these numbers the balance function was created as below. The opportunity cost is considered only when the model predicts “no” and employee compensation per call is considered only when the model predicts “yes”. With actual data about the average deposit and employee wages, the numbers for this function could be tuned, but the number chosen are at least somewhat reasonable.

balance = function(actual, predicted) {
  (100 - 1.07)   * sum(predicted == "yes" & actual == "yes") +
  (20.95 - 100) * sum(predicted == "no" & actual == "yes") +
  -1.07  * sum(predicted == "yes" & actual == "no") +
  20.95   * sum(predicted == "no" & actual == "no")
}

AUC, as a performance measurement for the models, was then calculated for each model. Accuracy and Sensitivity were also calculated as the comparison strategies.

Results

Method Cutoff Accuracy Sensitivity AUC
Random Forest
Small 0.06 0.8287 0.5222 0.6943
Medium 0.06 0.8688 0.5000 0.7070
Full 0.08 0.8712 0.5222 0.7181
Elastic Net
Small 0.21 0.8736 0.5000 0.7098
Medium 0.21 0.8736 0.5000 0.7098
Full 0.16 0.8615 0.5444 0.7224
Logistic Regression
Small 0.13 0.8153 0.5778 0.7111
Medium 0.20 0.8639 0.5000 0.7043
Full 0.12 0.8335 0.5556 0.7116

Out of the above 9 models, Elastic Net model with the full set of variables scored the highest AUC value 0.7224, therefore, the Elastic Net model was chosen. The following table shows its performance.

Confusion Matrix
Category Predicted Truth Freq
TN no no 660
FP yes no 73
FN no yes 41
TP yes yes 49

Discussion

The final model is the Elastic Net model which used the full variable features. The cutoff for this model is 0.16, categorizing the response variable as “yes” if it has a probability larger than 0.16 and “no” otherwise. More than 86% of the data can be classified correctly, with a sensitivity value being 0.5444 and a specificity value being 0.9004.

During the process of fitting different models, I noticed that different methods do not provide significantly different results on model performance. In contrast, it is the different cutoffs that affect the sensitivity and overall predictions, which clearly are more important for the goal of this project.

The best model that was found above focuses on increasing the prediction accuracy of both predicting “yes” and “no” correctly. It provides banks a good tool to filter the customers who may bring a negative impact on the banks’ benefit and help callers to focus on potential clients who are more likely to subscribe their product, so they can be more efficient when targeting at clients, therefore helping banks to maximize profits in terms of term deposit subscriptions.

Thank You!